Analysis of the Air Quality dataset

This document presents the analysis of an air quality dataset, which records the readings of a control station in an Italian city.

Here is reported the explanation for the features present in the dataset:

PT08 measurements are relative to a newer experimental detector, while the other values are from a reference device. We should expect the reference and experimental device to perform quite similarly when analyzing the same molecule.

Note that the reference detectors response is in mg/m^3, while the newer devices are NOT calibrated.

Dataset Exploration

We start the analysis by reading the dataset from a csv file:

air.dataset <- read.csv("~/Scaricati/AirQualityUCI/AirQualityUCI.csv", sep=";")
air.dataset %>% head(50)

We then proceed to clean up the dataset:

# convert comma se
comma.sep.double <- function(column){
    map_dbl( column, ~ as.numeric(sub(",", ".", .))  )
}

# keep the right amount of rows
air.dataset <- air.dataset[1:9357,]

# cleanup
air.dataset <- air.dataset %>% 
    # drop X and X.1 empty columns
    select(-X,-X.1) %>% 
    # create a date object column with dates
    mutate(date = as.Date(Date, "%d/%m/%Y")) %>%
    # split the date colum into day/month/year columns
    separate(Date, c("day", "month", "year")) %>% 
    # simplify Time column and make it numeric
    mutate(Time = map_int(Time, ~ as.integer(strsplit(., ".", fixed = TRUE)[[1]][1]) )) %>% 
    # convert columns to numeric
    mutate(
        CO.GT. = CO.GT. %>% comma.sep.double,
        C6H6.GT. = C6H6.GT. %>% comma.sep.double,
        CO.GT. = CO.GT. %>% comma.sep.double,
        CO.GT. = CO.GT. %>% comma.sep.double,
        T = T %>% comma.sep.double,
        RH = RH %>% comma.sep.double,
        AH = AH %>% comma.sep.double,
    ) %>% 
    # replace -200 with NAs
    mutate_at( 
        c("CO.GT.", "PT08.S1.CO.", "NMHC.GT.", "C6H6.GT.", "PT08.S2.NMHC.", "NOx.GT.", "PT08.S3.NOx.", 
          "NO2.GT.", "PT08.S4.NO2.", "PT08.S5.O3.", "T", "RH", "AH"),  
        ~ifelse(as.character(.) == "-200", NA, .)
    ) %>% mutate_at(c("day", "month", "year"), as.integer)

air.dataset %>% head()

Finally, we print a quick summary of the available data:

skim(air.dataset)
Data summary
Name air.dataset
Number of rows 9357
Number of columns 18
_______________________
Column type frequency:
Date 1
numeric 17
________________________
Group variables None

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
date 0 1 2004-03-10 2005-04-04 2004-09-21 391

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
day 0 1.00 15.88 8.81 1.00 8.00 16.0 23.00 31.00 ▇▇▇▇▆
month 0 1.00 6.31 3.44 1.00 3.00 6.0 9.00 12.00 ▇▅▅▅▆
year 0 1.00 2004.24 0.43 2004.00 2004.00 2004.0 2004.00 2005.00 ▇▁▁▁▂
Time 0 1.00 11.50 6.92 0.00 5.00 11.0 18.00 23.00 ▇▇▆▇▇
CO.GT. 1683 0.82 2.15 1.45 0.10 1.10 1.8 2.90 11.90 ▇▃▁▁▁
PT08.S1.CO. 366 0.96 1099.83 217.08 647.00 937.00 1063.0 1231.00 2040.00 ▃▇▃▁▁
NMHC.GT. 8443 0.10 218.81 204.46 7.00 67.00 150.0 297.00 1189.00 ▇▂▁▁▁
C6H6.GT. 366 0.96 10.08 7.45 0.10 4.40 8.2 14.00 63.70 ▇▃▁▁▁
PT08.S2.NMHC. 366 0.96 939.15 266.83 383.00 734.50 909.0 1116.00 2214.00 ▅▇▃▁▁
NOx.GT. 1639 0.82 246.90 212.98 2.00 98.00 180.0 326.00 1479.00 ▇▂▁▁▁
PT08.S3.NOx. 366 0.96 835.49 256.82 322.00 658.00 806.0 969.50 2683.00 ▇▇▁▁▁
NO2.GT. 1642 0.82 113.09 48.37 2.00 78.00 109.0 142.00 340.00 ▃▇▃▁▁
PT08.S4.NO2. 366 0.96 1456.26 346.21 551.00 1227.00 1463.0 1674.00 2775.00 ▂▇▇▂▁
PT08.S5.O3. 366 0.96 1022.91 398.48 221.00 731.50 963.0 1273.50 2523.00 ▃▇▅▂▁
T 366 0.96 18.32 8.83 -1.90 11.80 17.8 24.40 44.60 ▂▇▇▃▁
RH 366 0.96 49.23 17.32 9.20 35.80 49.6 62.50 88.70 ▂▆▇▆▂
AH 366 0.96 1.03 0.40 0.18 0.74 1.0 1.31 2.23 ▃▇▇▃▁

We continue by checking the distribution of the measurements of the different sensors (omitting NAs)

cols <- c( "CO.GT.", "PT08.S1.CO.", 
           "NMHC.GT.", "PT08.S2.NMHC.",
           "C6H6.GT.", "PT08.S5.O3.",
           "NOx.GT.", "PT08.S3.NOx.", 
           "NO2.GT.", "PT08.S4.NO2.")

plot_hisogram <- function(df){
    function(col_name){
        col <- col_name
        ggplot(df) +
            geom_histogram(aes(x = df[, col]), alpha=0.7, bins = 30, color="black", fill="lightblue") +
            geom_vline(xintercept = mean(df[, col], na.rm = T), color="red", linetype="dotted") +
            geom_vline(xintercept = median(df[, col], na.rm = T), color="red", linetype="dashed") +
            labs(x=col)
    }
} 

plots <- purrr::map(cols, plot_hisogram(air.dataset))
do.call("grid.arrange", c(plots, ncol=2))

We observe that NMHC.GT. counts are quite low.

The remaining columns are T (temperature), RH (relative humidity) and AH (absolute humidity)

cols <- c("T","RH","AH")
plots <- purrr::map(cols, plot_hisogram(air.dataset))
do.call("grid.arrange", c(plots, ncol=1))

pairs(air.dataset %>% select(T,RH,AH))

From the plot, it looks like the absolute humidity tends to be high when the temperature is high, while the converse holds for relative humidity. This is due to the definition of \(AH\) and \(RH\):

\[AH = \frac{m}{V}\] is the ratio between the mass of water vapor and the volume of air and vapor mixture. \(V\) is dependent by the temperature.

\[RH = \frac{p}{p^*}\] is, intuitively, the ratio of how much vapor the air is currently holding if compared to the maximal amount before condensation. Keep in mind that colder air can keep less vapor; this is an explanation of the negative correlation of RH with the temperature.

We can now compare the correlations of all the available variables:

cormat <- round(cor(air.dataset %>% select(T,RH,AH, dplyr::starts_with("PT08."), dplyr::ends_with("GT.")), use = "complete.obs") ,2)
ggcorrplot(cormat, hc.order = TRUE, outline.col = "white")

NOx new and reference detectors seem to have an opposite output, this might be just an artifact due to the absence of calibration.

pairs(air.dataset %>% select(dplyr::starts_with("PT08.")))

This plots show how the response of the detectors tend to be positively correlated, again with the exception of S3 sensor of the experimental detector.

pairs(air.dataset %>% select(dplyr::ends_with("GT.")))

The reference setup shows instead positive correlation between all sensors.

Temperature

We can consider the temperature starting with a boxplot:

df <- air.dataset %>% mutate(trimester = ((month-1) %/% 3) + 1) 

ggplot(df) +
    geom_boxplot(aes(x=as.factor(month), y=T)) +
    labs(y = "Temperature", x="Month", title="Temperature measurements during the years 2004-2005")

we then divide the observations in four trimester and analyze the temperature distributions:

(plot_hisogram(df))("T") +
    facet_wrap(~ trimester) +
    labs(title="Temperature distribution per trimester")

Quarter 2 and 3 seem to have a bell shaped distribution, we can check if those distributions can be considered normal with qqplots and Kolmogorov-Smirnov tests:

ggplot(df) +
    geom_qq(aes(sample=T)) +
    geom_qq_line(aes(sample=T)) +
    facet_wrap(~ trimester) +
    labs(title="Qqplots for temperature distribution in per trimester")

for(i in 1:4){
    print(i)
    ks.test(df %>% filter(trimester==i) %>% pull(T) %>% scale, pnorm) %>% print
}
## [1] 1
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  df %>% filter(trimester == i) %>% pull(T) %>% scale
## D = 0.055459, p-value = 3.441e-07
## alternative hypothesis: two-sided
## 
## [1] 2
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  df %>% filter(trimester == i) %>% pull(T) %>% scale
## D = 0.068143, p-value = 2.885e-09
## alternative hypothesis: two-sided
## 
## [1] 3
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  df %>% filter(trimester == i) %>% pull(T) %>% scale
## D = 0.067861, p-value = 5.511e-09
## alternative hypothesis: two-sided
## 
## [1] 4
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  df %>% filter(trimester == i) %>% pull(T) %>% scale
## D = 0.038785, p-value = 0.003326
## alternative hypothesis: two-sided

Using the whole dataset:

ks.test(df %>% pull(T) %>% scale, pnorm)
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  df %>% pull(T) %>% scale
## D = 0.040003, p-value = 6.368e-13
## alternative hypothesis: two-sided

This shows how no trimester’s temperature seems to be normally distributed. That could easily be caused by the fact huge differences in temperature can be caused simply by weather changes (think for example of comparing rainy and sunny days, a probable cause of the bi-modality shown by histograms).

Comparing pollution

We can now consider if pollution varies among trimesters and/or day of the week. We begin by considering trimesters:

col.names <- df %>% select(dplyr::ends_with("GT.")) %>% colnames()
for(i in col.names){
    df. <- (df %>% select(!!i, trimester)) 
    df. <- df.[complete.cases(df.), ]
    levs <- levels( as.factor(df.$trimester) )
    CN <- as.list(as.data.frame(combn(levs,2)))
    p <- df. %>%
        pivot_longer(dplyr::ends_with("GT."), names_to = "agent", values_to="values")%>% ggplot(aes(x = as.factor(trimester), y=values)) +
        geom_violin() +
        geom_boxplot(width=0.1) +
        labs(title=paste(i, "readout per trimester"), x="trimester", y="readout")
    print(p)
}

From this plot we see that trimester 4 seems to be the more polluted in general, and that we completely miss NMHC measurements for the third and fourth trimesters. Due to the strong correlation between contaminants, it could be possible to estimate NMHC values with a linear regression model, we will consider this possibility when considering how to manage NA values.

We can now consider days of the week, for example we can ask ourselves if pollution is higher during working days, or in the week-end:

df <- df %>% mutate(weekday=factor(weekdays(date),  levels = c("lunedì" , "martedì", "mercoledì", "giovedì", "venerdì", "sabato", "domenica")))
for(i in col.names){
    df. <- (df %>% select(!!i, weekday)) 
    df. <- df.[complete.cases(df.), ]
    p <- df. %>%
            pivot_longer(dplyr::ends_with("GT."), names_to = "agent", values_to="values")%>% ggplot(aes(x = as.factor(weekday), y=values)) +
            geom_violin() +
            geom_boxplot(width=0.1) +
            facet_wrap(~ agent, scales = "free_y") 
    print(p)
}

it seems that pollution is lower on Sunday. We can test this, at least in a first approximation, with a paired t-test on the complete data:

for(i in col.names){
    print(i)
    t.test(df %>% filter(weekday != "domenica") %>% pull(!!i), df %>% filter(weekday == "domenica") %>% pull(!!i)) %>% print
}
## [1] "CO.GT."
## 
##  Welch Two Sample t-test
## 
## data:  df %>% filter(weekday != "domenica") %>% pull(!!i) and df %>% filter(weekday == "domenica") %>% pull(!!i)
## t = 24.785, df = 2161.5, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.7680219 0.9000002
## sample estimates:
## mean of x mean of y 
##  2.272080  1.438069 
## 
## [1] "NMHC.GT."
## 
##  Welch Two Sample t-test
## 
## data:  df %>% filter(weekday != "domenica") %>% pull(!!i) and df %>% filter(weekday == "domenica") %>% pull(!!i)
## t = 13.562, df = 409.6, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  114.3584 153.1312
## sample estimates:
## mean of x mean of y 
## 232.85941  99.11458 
## 
## [1] "C6H6.GT."
## 
##  Welch Two Sample t-test
## 
## data:  df %>% filter(weekday != "domenica") %>% pull(!!i) and df %>% filter(weekday == "domenica") %>% pull(!!i)
## t = 34.071, df = 3405.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  4.452374 4.996101
## sample estimates:
## mean of x mean of y 
## 10.775111  6.050873 
## 
## [1] "NOx.GT."
## 
##  Welch Two Sample t-test
## 
## data:  df %>% filter(weekday != "domenica") %>% pull(!!i) and df %>% filter(weekday == "domenica") %>% pull(!!i)
## t = 20.239, df = 2186.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   89.20719 108.34970
## sample estimates:
## mean of x mean of y 
##  260.7831  162.0046 
## 
## [1] "NO2.GT."
## 
##  Welch Two Sample t-test
## 
## data:  df %>% filter(weekday != "domenica") %>% pull(!!i) and df %>% filter(weekday == "domenica") %>% pull(!!i)
## t = 18.18, df = 1635.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  22.25275 27.63514
## sample estimates:
## mean of x mean of y 
##  116.5992   91.6553

Our hypothesis is strongly supported by all the t-tests.

Quick look at missing data

We can start by exploring how the NA values are distributed in our dataset using the VIM package:

df %>% aggr(cex.axis=.6, numbers=T,cex.numbers=0.6)

We observe that T, RH and AH have the same behavior in terms of NA values. Furthermore, they often relate to absence of data from other sensors.

df %>% select(dplyr::ends_with("GT."), dplyr::starts_with("PT08")) %>% aggr(cex.axis=.6,numbers=T,cex.numbers=0.6)

Here it is more evident that the experimental detector can only be completely offline or fully functional.

par(mfrow=c(2,3))
marginplot(df %>% select(CO.GT., PT08.S1.CO.))
marginplot(df %>% select(NMHC.GT., PT08.S2.NMHC.))
marginplot(df %>% select(NOx.GT., PT08.S3.NOx.))
marginplot(df %>% select(C6H6.GT., PT08.S5.O3.))
marginplot(df %>% select(NO2.GT., PT08.S4.NO2.))

Considering time versus polluting agents

We first transform the date and time of data acquisition into a real number:

df <- df %>% mutate(num_time = (date %>% as.integer()) + (Time/24)) 
"begin of the measurements"
## [1] "begin of the measurements"
df[1, "date"]
## [1] "2004-03-10"
"end of the measurements"
## [1] "end of the measurements"
df[nrow(df), "date"]
## [1] "2005-04-04"

We plot now the measurements of the reference device:

ggplot(df %>% pivot_longer(dplyr::ends_with("GT."), names_to = "agent", values_to="values")) + 
    geom_line(aes(x=num_time, y=values)) + 
    facet_wrap(~ agent, scales = "free_y") +
    geom_smooth(aes(x=num_time, y=values), color="red", linetype="dashed")

and of the newer one:

ggplot(df %>% pivot_longer(dplyr::starts_with("P"), names_to = "agent", values_to="values")) + 
    geom_line(aes(x=num_time, y=values)) + 
    facet_wrap(~ agent, scales = "free_y") +
    geom_smooth(aes(x=num_time, y=values), color="red", linetype="dashed")

We don’t have reference data to calibrate the newer detector.

Managing NAs

We now decide how to cope with NA values in our dataset. From the plots defined in the previous section about NA, we can decide to eliminate cases when more than half of the detectors were shut down.

First of all, we check the operational status of the reference detector when the experimental one is offline:

df %>% filter(is.na(PT08.S1.CO.) & is.na(PT08.S2.NMHC.) & is.na(PT08.S3.NOx.) & is.na(PT08.S4.NO2.) & is.na(PT08.S5.O3.)) %>% 
    select( dplyr::ends_with(".GT.")) %>% 
    mutate_all( ~ ifelse(is.na(.),1,0)) %>%
    select(-NMHC.GT., -C6H6.GT.) %>% 
    group_by_all() %>% 
    dplyr::summarise(n = n()) 

We have observed that C6H6.GT. is always non operational and that NMHC.GT. ceases to function after a while so we excluded them from this table.

When instead the newer detector is online:

df %>% filter(!is.na(PT08.S1.CO.) & !is.na(PT08.S2.NMHC.) & !is.na(PT08.S3.NOx.) & !is.na(PT08.S4.NO2.) & !is.na(PT08.S5.O3.)) %>% 
    select( dplyr::ends_with(".GT.")) %>% 
    mutate_all( ~ ifelse(is.na(.),1,0)) %>%
    select(-NMHC.GT., -C6H6.GT.) %>% 
    group_by_all() %>% 
    dplyr::summarise(n = n()) 

(here C6H6.GT. is always operational)

To manage the NA values in this context, we can eliminate the 49 cases in which the newer and some older detector are offline. We will use instead regression imputation for all the other cases.

# remove 49 cases
df <- df %>% filter(! (is.na(PT08.S1.CO.) & (is.na(CO.GT.) | is.na(NOx.GT.) | is.na(NO2.GT.))))
# apply (stochastic) regression imputation
imp <- mice(df, method = "norm.nob")
## 
##  iter imp variable
##   1   1  CO.GT.  PT08.S1.CO.  NMHC.GT.  C6H6.GT.  PT08.S2.NMHC.  NOx.GT.  PT08.S3.NOx.  NO2.GT.  PT08.S4.NO2.  PT08.S5.O3.  T  RH  AH
##   1   2  CO.GT.  PT08.S1.CO.  NMHC.GT.  C6H6.GT.  PT08.S2.NMHC.  NOx.GT.  PT08.S3.NOx.  NO2.GT.  PT08.S4.NO2.  PT08.S5.O3.  T  RH  AH
##   1   3  CO.GT.  PT08.S1.CO.  NMHC.GT.  C6H6.GT.  PT08.S2.NMHC.  NOx.GT.  PT08.S3.NOx.  NO2.GT.  PT08.S4.NO2.  PT08.S5.O3.  T  RH  AH
##   1   4  CO.GT.  PT08.S1.CO.  NMHC.GT.  C6H6.GT.  PT08.S2.NMHC.  NOx.GT.  PT08.S3.NOx.  NO2.GT.  PT08.S4.NO2.  PT08.S5.O3.  T  RH  AH
##   1   5  CO.GT.  PT08.S1.CO.  NMHC.GT.  C6H6.GT.  PT08.S2.NMHC.  NOx.GT.  PT08.S3.NOx.  NO2.GT.  PT08.S4.NO2.  PT08.S5.O3.  T  RH  AH
##   2   1  CO.GT.  PT08.S1.CO.  NMHC.GT.  C6H6.GT.  PT08.S2.NMHC.  NOx.GT.  PT08.S3.NOx.  NO2.GT.  PT08.S4.NO2.  PT08.S5.O3.  T  RH  AH
##   2   2  CO.GT.  PT08.S1.CO.  NMHC.GT.  C6H6.GT.  PT08.S2.NMHC.  NOx.GT.  PT08.S3.NOx.  NO2.GT.  PT08.S4.NO2.  PT08.S5.O3.  T  RH  AH
##   2   3  CO.GT.  PT08.S1.CO.  NMHC.GT.  C6H6.GT.  PT08.S2.NMHC.  NOx.GT.  PT08.S3.NOx.  NO2.GT.  PT08.S4.NO2.  PT08.S5.O3.  T  RH  AH
##   2   4  CO.GT.  PT08.S1.CO.  NMHC.GT.  C6H6.GT.  PT08.S2.NMHC.  NOx.GT.  PT08.S3.NOx.  NO2.GT.  PT08.S4.NO2.  PT08.S5.O3.  T  RH  AH
##   2   5  CO.GT.  PT08.S1.CO.  NMHC.GT.  C6H6.GT.  PT08.S2.NMHC.  NOx.GT.  PT08.S3.NOx.  NO2.GT.  PT08.S4.NO2.  PT08.S5.O3.  T  RH  AH
##   3   1  CO.GT.  PT08.S1.CO.  NMHC.GT.  C6H6.GT.  PT08.S2.NMHC.  NOx.GT.  PT08.S3.NOx.  NO2.GT.  PT08.S4.NO2.  PT08.S5.O3.  T  RH  AH
##   3   2  CO.GT.  PT08.S1.CO.  NMHC.GT.  C6H6.GT.  PT08.S2.NMHC.  NOx.GT.  PT08.S3.NOx.  NO2.GT.  PT08.S4.NO2.  PT08.S5.O3.  T  RH  AH
##   3   3  CO.GT.  PT08.S1.CO.  NMHC.GT.  C6H6.GT.  PT08.S2.NMHC.  NOx.GT.  PT08.S3.NOx.  NO2.GT.  PT08.S4.NO2.  PT08.S5.O3.  T  RH  AH
##   3   4  CO.GT.  PT08.S1.CO.  NMHC.GT.  C6H6.GT.  PT08.S2.NMHC.  NOx.GT.  PT08.S3.NOx.  NO2.GT.  PT08.S4.NO2.  PT08.S5.O3.  T  RH  AH
##   3   5  CO.GT.  PT08.S1.CO.  NMHC.GT.  C6H6.GT.  PT08.S2.NMHC.  NOx.GT.  PT08.S3.NOx.  NO2.GT.  PT08.S4.NO2.  PT08.S5.O3.  T  RH  AH
##   4   1  CO.GT.  PT08.S1.CO.  NMHC.GT.  C6H6.GT.  PT08.S2.NMHC.  NOx.GT.  PT08.S3.NOx.  NO2.GT.  PT08.S4.NO2.  PT08.S5.O3.  T  RH  AH
##   4   2  CO.GT.  PT08.S1.CO.  NMHC.GT.  C6H6.GT.  PT08.S2.NMHC.  NOx.GT.  PT08.S3.NOx.  NO2.GT.  PT08.S4.NO2.  PT08.S5.O3.  T  RH  AH
##   4   3  CO.GT.  PT08.S1.CO.  NMHC.GT.  C6H6.GT.  PT08.S2.NMHC.  NOx.GT.  PT08.S3.NOx.  NO2.GT.  PT08.S4.NO2.  PT08.S5.O3.  T  RH  AH
##   4   4  CO.GT.  PT08.S1.CO.  NMHC.GT.  C6H6.GT.  PT08.S2.NMHC.  NOx.GT.  PT08.S3.NOx.  NO2.GT.  PT08.S4.NO2.  PT08.S5.O3.  T  RH  AH
##   4   5  CO.GT.  PT08.S1.CO.  NMHC.GT.  C6H6.GT.  PT08.S2.NMHC.  NOx.GT.  PT08.S3.NOx.  NO2.GT.  PT08.S4.NO2.  PT08.S5.O3.  T  RH  AH
##   5   1  CO.GT.  PT08.S1.CO.  NMHC.GT.  C6H6.GT.  PT08.S2.NMHC.  NOx.GT.  PT08.S3.NOx.  NO2.GT.  PT08.S4.NO2.  PT08.S5.O3.  T  RH  AH
##   5   2  CO.GT.  PT08.S1.CO.  NMHC.GT.  C6H6.GT.  PT08.S2.NMHC.  NOx.GT.  PT08.S3.NOx.  NO2.GT.  PT08.S4.NO2.  PT08.S5.O3.  T  RH  AH
##   5   3  CO.GT.  PT08.S1.CO.  NMHC.GT.  C6H6.GT.  PT08.S2.NMHC.  NOx.GT.  PT08.S3.NOx.  NO2.GT.  PT08.S4.NO2.  PT08.S5.O3.  T  RH  AH
##   5   4  CO.GT.  PT08.S1.CO.  NMHC.GT.  C6H6.GT.  PT08.S2.NMHC.  NOx.GT.  PT08.S3.NOx.  NO2.GT.  PT08.S4.NO2.  PT08.S5.O3.  T  RH  AH
##   5   5  CO.GT.  PT08.S1.CO.  NMHC.GT.  C6H6.GT.  PT08.S2.NMHC.  NOx.GT.  PT08.S3.NOx.  NO2.GT.  PT08.S4.NO2.  PT08.S5.O3.  T  RH  AH
df_filled <- complete(imp)

We can check now how NMHC.GT. and the other variables were imputed:

ggplot(df_filled %>% select(NMHC.GT., num_time)) + 
    geom_line(aes(x=num_time, y=NMHC.GT.)) +
    geom_smooth(aes(x=num_time, y=NMHC.GT.), color="red", linetype="dashed")

pairs(df_filled %>% select(dplyr::ends_with("GT.")))

Avoiding the use of T, AH and RH columns, the results differ widely for NMHC.GT.:

imp <- mice(df %>% select(dplyr::starts_with("PT08."), dplyr::ends_with("GT."), num_time), method = "norm.nob")
## 
##  iter imp variable
##   1   1  PT08.S1.CO.  PT08.S2.NMHC.  PT08.S3.NOx.  PT08.S4.NO2.  PT08.S5.O3.  CO.GT.  NMHC.GT.  C6H6.GT.  NOx.GT.  NO2.GT.
##   1   2  PT08.S1.CO.  PT08.S2.NMHC.  PT08.S3.NOx.  PT08.S4.NO2.  PT08.S5.O3.  CO.GT.  NMHC.GT.  C6H6.GT.  NOx.GT.  NO2.GT.
##   1   3  PT08.S1.CO.  PT08.S2.NMHC.  PT08.S3.NOx.  PT08.S4.NO2.  PT08.S5.O3.  CO.GT.  NMHC.GT.  C6H6.GT.  NOx.GT.  NO2.GT.
##   1   4  PT08.S1.CO.  PT08.S2.NMHC.  PT08.S3.NOx.  PT08.S4.NO2.  PT08.S5.O3.  CO.GT.  NMHC.GT.  C6H6.GT.  NOx.GT.  NO2.GT.
##   1   5  PT08.S1.CO.  PT08.S2.NMHC.  PT08.S3.NOx.  PT08.S4.NO2.  PT08.S5.O3.  CO.GT.  NMHC.GT.  C6H6.GT.  NOx.GT.  NO2.GT.
##   2   1  PT08.S1.CO.  PT08.S2.NMHC.  PT08.S3.NOx.  PT08.S4.NO2.  PT08.S5.O3.  CO.GT.  NMHC.GT.  C6H6.GT.  NOx.GT.  NO2.GT.
##   2   2  PT08.S1.CO.  PT08.S2.NMHC.  PT08.S3.NOx.  PT08.S4.NO2.  PT08.S5.O3.  CO.GT.  NMHC.GT.  C6H6.GT.  NOx.GT.  NO2.GT.
##   2   3  PT08.S1.CO.  PT08.S2.NMHC.  PT08.S3.NOx.  PT08.S4.NO2.  PT08.S5.O3.  CO.GT.  NMHC.GT.  C6H6.GT.  NOx.GT.  NO2.GT.
##   2   4  PT08.S1.CO.  PT08.S2.NMHC.  PT08.S3.NOx.  PT08.S4.NO2.  PT08.S5.O3.  CO.GT.  NMHC.GT.  C6H6.GT.  NOx.GT.  NO2.GT.
##   2   5  PT08.S1.CO.  PT08.S2.NMHC.  PT08.S3.NOx.  PT08.S4.NO2.  PT08.S5.O3.  CO.GT.  NMHC.GT.  C6H6.GT.  NOx.GT.  NO2.GT.
##   3   1  PT08.S1.CO.  PT08.S2.NMHC.  PT08.S3.NOx.  PT08.S4.NO2.  PT08.S5.O3.  CO.GT.  NMHC.GT.  C6H6.GT.  NOx.GT.  NO2.GT.
##   3   2  PT08.S1.CO.  PT08.S2.NMHC.  PT08.S3.NOx.  PT08.S4.NO2.  PT08.S5.O3.  CO.GT.  NMHC.GT.  C6H6.GT.  NOx.GT.  NO2.GT.
##   3   3  PT08.S1.CO.  PT08.S2.NMHC.  PT08.S3.NOx.  PT08.S4.NO2.  PT08.S5.O3.  CO.GT.  NMHC.GT.  C6H6.GT.  NOx.GT.  NO2.GT.
##   3   4  PT08.S1.CO.  PT08.S2.NMHC.  PT08.S3.NOx.  PT08.S4.NO2.  PT08.S5.O3.  CO.GT.  NMHC.GT.  C6H6.GT.  NOx.GT.  NO2.GT.
##   3   5  PT08.S1.CO.  PT08.S2.NMHC.  PT08.S3.NOx.  PT08.S4.NO2.  PT08.S5.O3.  CO.GT.  NMHC.GT.  C6H6.GT.  NOx.GT.  NO2.GT.
##   4   1  PT08.S1.CO.  PT08.S2.NMHC.  PT08.S3.NOx.  PT08.S4.NO2.  PT08.S5.O3.  CO.GT.  NMHC.GT.  C6H6.GT.  NOx.GT.  NO2.GT.
##   4   2  PT08.S1.CO.  PT08.S2.NMHC.  PT08.S3.NOx.  PT08.S4.NO2.  PT08.S5.O3.  CO.GT.  NMHC.GT.  C6H6.GT.  NOx.GT.  NO2.GT.
##   4   3  PT08.S1.CO.  PT08.S2.NMHC.  PT08.S3.NOx.  PT08.S4.NO2.  PT08.S5.O3.  CO.GT.  NMHC.GT.  C6H6.GT.  NOx.GT.  NO2.GT.
##   4   4  PT08.S1.CO.  PT08.S2.NMHC.  PT08.S3.NOx.  PT08.S4.NO2.  PT08.S5.O3.  CO.GT.  NMHC.GT.  C6H6.GT.  NOx.GT.  NO2.GT.
##   4   5  PT08.S1.CO.  PT08.S2.NMHC.  PT08.S3.NOx.  PT08.S4.NO2.  PT08.S5.O3.  CO.GT.  NMHC.GT.  C6H6.GT.  NOx.GT.  NO2.GT.
##   5   1  PT08.S1.CO.  PT08.S2.NMHC.  PT08.S3.NOx.  PT08.S4.NO2.  PT08.S5.O3.  CO.GT.  NMHC.GT.  C6H6.GT.  NOx.GT.  NO2.GT.
##   5   2  PT08.S1.CO.  PT08.S2.NMHC.  PT08.S3.NOx.  PT08.S4.NO2.  PT08.S5.O3.  CO.GT.  NMHC.GT.  C6H6.GT.  NOx.GT.  NO2.GT.
##   5   3  PT08.S1.CO.  PT08.S2.NMHC.  PT08.S3.NOx.  PT08.S4.NO2.  PT08.S5.O3.  CO.GT.  NMHC.GT.  C6H6.GT.  NOx.GT.  NO2.GT.
##   5   4  PT08.S1.CO.  PT08.S2.NMHC.  PT08.S3.NOx.  PT08.S4.NO2.  PT08.S5.O3.  CO.GT.  NMHC.GT.  C6H6.GT.  NOx.GT.  NO2.GT.
##   5   5  PT08.S1.CO.  PT08.S2.NMHC.  PT08.S3.NOx.  PT08.S4.NO2.  PT08.S5.O3.  CO.GT.  NMHC.GT.  C6H6.GT.  NOx.GT.  NO2.GT.
df_test <- complete(imp)
ggplot(df_test %>% select(NMHC.GT., num_time)) + 
    geom_line(aes(x=num_time, y=NMHC.GT.)) +
    geom_smooth(aes(x=num_time, y=NMHC.GT.), color="red", linetype="dashed")

This continuously increasing trend is actually quite unexpected.

plot(df_filled$NMHC.GT., df_test$NMHC.GT.)

> Note that we should not be able to see negative values from this detector

PCA

as the measurement are quite correlated, it makes sense to consider the PCA technique:

df. <-  (df_filled %>% select(-month,-year,-Time, -trimester, -weekday, -day, -date, -num_time)) %>% scale()
pca_data <- df. %>% princomp()
pca_data %>% summary
## Importance of components:
##                           Comp.1    Comp.2    Comp.3     Comp.4     Comp.5
## Standard deviation     2.7114946 1.5756635 1.1862325 0.81404621 0.62596292
## Proportion of Variance 0.5656148 0.1909986 0.1082537 0.05098019 0.03014398
## Cumulative Proportion  0.5656148 0.7566135 0.8648672 0.91584741 0.94599139
##                            Comp.6     Comp.7      Comp.8      Comp.9
## Standard deviation     0.45034111 0.39870609 0.324490049 0.308818124
## Proportion of Variance 0.01560222 0.01222951 0.008100393 0.007336837
## Cumulative Proportion  0.96159361 0.97382312 0.981923513 0.989260350
##                            Comp.10     Comp.11     Comp.12      Comp.13
## Standard deviation     0.252257512 0.192255769 0.174414225 0.0926498376
## Proportion of Variance 0.004895438 0.002843558 0.002340276 0.0006603781
## Cumulative Proportion  0.994155788 0.996999346 0.999339622 1.0000000000
cormat <- as.data.frame(cor(df., pca_data$scores))
ggcorrplot(cormat, hc.order = TRUE, outline.col = "white")

# screeplot
pca_var <- pca_data$sdev^2
pve <- pca_var/sum(pca_var)
ggplot(data.frame(pve=pve)) +
    geom_area(aes(y=pve, x=1:length(pve)), fill = "lightblue", color="black", alpha = 0.4 ) +
    geom_node_label(aes(y = pve, x=1:length(pve), label=formatC(cumsum(pve), digits=2, format = "f")) ) +
    labs(title="Screeplot of PCA with Temperature and Humidity", x="# of Components", y = "fraction of explained variance")

# colored by trimester
ggbiplot(pca_data, alpha = 0.1, group = df_filled$trimester %>% as.factor())

# colored by weekday
ggbiplot(pca_data, alpha = 0.1, group = df_filled$weekday %>% as.factor())

These two biplots show how the summer tends to be less polluted and confirm our theories about weekends.

pca_data$loadings
## 
## Loadings:
##               Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## CO.GT.         0.347                       0.308  0.294  0.106         0.345
## PT08.S1.CO.    0.347                             -0.358 -0.237  0.547  0.529
## NMHC.GT.       0.253  0.110  0.247  0.699 -0.370  0.357  0.104  0.195 -0.194
## C6H6.GT.       0.353                       0.273         0.196 -0.291       
## PT08.S2.NMHC.  0.357                       0.140 -0.135  0.110 -0.367       
## NOx.GT.        0.291  0.253 -0.103 -0.387  0.298  0.407  0.269  0.230 -0.153
## PT08.S3.NOx.  -0.309                0.297  0.684  0.155 -0.442  0.178 -0.208
## NO2.GT.        0.276  0.269  0.231 -0.306 -0.245  0.272 -0.704 -0.243       
## PT08.S4.NO2.   0.255 -0.406 -0.113  0.271  0.148 -0.103 -0.250 -0.339       
## PT08.S5.O3.    0.344                             -0.435 -0.116  0.289 -0.679
## T                    -0.583  0.258 -0.209         0.167         0.177       
## RH                    0.202 -0.782  0.160         0.127 -0.144 -0.155       
## AH                   -0.532 -0.396 -0.106 -0.168  0.362         0.201 -0.158
##               Comp.10 Comp.11 Comp.12 Comp.13
## CO.GT.         0.741                         
## PT08.S1.CO.   -0.286  -0.135                 
## NMHC.GT.      -0.131                         
## C6H6.GT.      -0.203  -0.383   0.346  -0.588 
## PT08.S2.NMHC. -0.142  -0.134   0.192   0.775 
## NOx.GT.       -0.398   0.221  -0.281         
## PT08.S3.NOx.          -0.114   0.104   0.128 
## NO2.GT.               -0.108                 
## PT08.S4.NO2.           0.358  -0.566  -0.142 
## PT08.S5.O3.    0.313   0.108                 
## T                      0.477   0.496         
## RH                     0.294   0.401         
## AH                    -0.531  -0.141         
## 
##                Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## SS loadings     1.000  1.000  1.000  1.000  1.000  1.000  1.000  1.000  1.000
## Proportion Var  0.077  0.077  0.077  0.077  0.077  0.077  0.077  0.077  0.077
## Cumulative Var  0.077  0.154  0.231  0.308  0.385  0.462  0.538  0.615  0.692
##                Comp.10 Comp.11 Comp.12 Comp.13
## SS loadings      1.000   1.000   1.000   1.000
## Proportion Var   0.077   0.077   0.077   0.077
## Cumulative Var   0.769   0.846   0.923   1.000

We repeat the analysis using only detector’s data:

df. <-  (df_filled %>% select(-month,-year,-Time, -trimester, -weekday, -day, -date, -T, -AH, -RH, -num_time)) %>% scale()
pca_data <- df. %>% princomp()
# screeplot
pca_var <- pca_data$sdev^2
pve <- pca_var/sum(pca_var)
ggplot(data.frame(pve=pve)) +
    geom_area(aes(y=pve, x=1:length(pve)), fill = "lightblue", color="black", alpha = 0.4 ) +
    geom_node_label(aes(y = pve, x=1:length(pve), label=formatC(cumsum(pve), digits=2, format = "f")) ) +
    labs(title="Screeplot of PCA with Temperature and Humidity", x="# of Components", y = "fraction of explained variance")

ggbiplot(pca_data, alpha = 0.1, group = df_filled$trimester %>% as.factor()) 

ggbiplot(pca_data, alpha = 0.1, group = df_filled$weekday %>% as.factor())

This shows how T, AH and RH have an impact on the explained variance, as seen before in the loadings of the second component. There are not evident clusters in this representation.

pca_data$loadings
## 
## Loadings:
##               Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## CO.GT.         0.348                0.374         0.220  0.375         0.740
## PT08.S1.CO.    0.347                       0.346 -0.519  0.567 -0.282 -0.284
## NMHC.GT.       0.256         0.895         0.238  0.140         0.150 -0.139
## C6H6.GT.       0.353 -0.173         0.218 -0.151  0.201 -0.270 -0.450 -0.163
## PT08.S2.NMHC.  0.356 -0.165               -0.253  0.132 -0.219 -0.401       
## NOx.GT.        0.295  0.441 -0.295  0.369  0.215  0.329         0.380 -0.436
## PT08.S3.NOx.  -0.308         0.223  0.799 -0.155 -0.401 -0.116              
## NO2.GT.        0.280  0.521  0.139 -0.170 -0.689 -0.312         0.111       
## PT08.S4.NO2.   0.249 -0.681 -0.101        -0.290                0.588 -0.139
## PT08.S5.O3.    0.345        -0.155         0.328 -0.484 -0.616  0.175  0.314
##               Comp.10
## CO.GT.               
## PT08.S1.CO.          
## NMHC.GT.             
## C6H6.GT.      -0.658 
## PT08.S2.NMHC.  0.735 
## NOx.GT.              
## PT08.S3.NOx.         
## NO2.GT.              
## PT08.S4.NO2.         
## PT08.S5.O3.          
## 
##                Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## SS loadings       1.0    1.0    1.0    1.0    1.0    1.0    1.0    1.0    1.0
## Proportion Var    0.1    0.1    0.1    0.1    0.1    0.1    0.1    0.1    0.1
## Cumulative Var    0.1    0.2    0.3    0.4    0.5    0.6    0.7    0.8    0.9
##                Comp.10
## SS loadings        1.0
## Proportion Var     0.1
## Cumulative Var     1.0

Alternate approach using tSNE for feature reduction:

Colored by weekday:

tsne(df_filled  %>% select(-month,-year,-Time, -trimester, -weekday, -day, -date, -num_time) %>% scale %>% t, labels = df_filled$weekday, dotsize=0.1)

Colored by trimester:

tsne(df_filled %>% select(-month,-year,-Time, -trimester, -weekday, -day, -date, -T, -AH, -RH, -num_time) %>% scale %>% t, labels = df_filled$trimester %>% as.factor, dotsize=0.1)

It is interesting to see a sort of layered clusterization for the trimesters (even without any information about temperature/date).

Other clustering methods

Here we try to see how other clustering methods map on the biplots we have already computed:

  • Hierarchical:
df. <- df_filled %>% select(-month,-year,-Time, -trimester, -weekday, -day, -date, -num_time) 
dist_mat <- dist(df. %>% scale, method = 'euclidian')
hclust_ave <- hclust(dist_mat, method = 'complete')
hclust_ave_cut <- cutree(hclust_ave, 8)
ggbiplot(pca_data, alpha = 0.1, group = hclust_ave_cut %>% as.factor)

K-means:

fviz_nbclust(df. %>% scale, kmeans, method = "silhouette")

kmeans_clusters <- kmeans(df. %>% scale,2)
ggbiplot(pca_data, alpha = 0.1, group = kmeans_clusters$cluster %>% as.factor)

Violin plots with comparisons

Finally, we can use the imputed datasets to complete our violin plots with statistical tests

col.names <- df_filled %>% select(dplyr::ends_with("GT.")) %>% colnames()
for(i in col.names){
    df. <- (df_filled %>% select(!!i, trimester)) 
    df. <- df.[complete.cases(df.), ]
    levs <- levels( as.factor(df.$trimester) )
    CN <- as.list(as.data.frame(combn(levs,2)))
    p <- df. %>%
        pivot_longer(dplyr::ends_with("GT."), names_to = "agent", values_to="values") %>% ggplot(aes(x = as.factor(trimester), y=values)) +
        geom_violin() +
        geom_boxplot(width=0.1) +
        stat_compare_means(comparisons = CN, map_signif_level = TRUE) + 
        labs(title=paste(i, "readout per trimester"), x="trimester", y="readout")
    print(p)
}